This tutorial will walk you through the entire data science pipeline: data collection/management, exploratory data analysis, hypothesis testing, machine learning, and the curation of a message derived from insights learned during the data analysis.
The dataset we will be working with is provided by the National Center for Health Statistics: “Leading Causes of Death: United States”. This dataset shows the death rates for the leading causes of death in the US over the years 1999-2015. Data is contained for each of the 50 states including the District of Columbia.
It is important to understand the attribute “Age-adjusted Death Rate”. Death rate is calculated by the formula (total deaths / total population) x 100,000. For example, let us say there were 1,000 deaths due to the flu in a state that has a popuation of 2,000,000. Our death rate would be (1,000 / 2,000,000) x 100,000 = 50. This rate of 50 means there were 50 deaths due to the flu per 100,000 people. This is the crude death rate, the dataset’s is age adjusted. Age-adjustment is a way of standardizing the death rate. A death rate can be distorted by a population. Suppose one state has a large eldery population, this state might have higher death rates simply because the eldery are more likely to die. The death rates are adjusted to have rates that would exist under a normal/standard age distribution.
Further reading about age-adjusted death rates: https://health.mo.gov/data/mica/CDP_MICA/AARate.html
Link to the dataset: https://data.cdc.gov/NCHS/NCHS-Leading-Causes-of-Death-United-States/bi63-dtpu
Our dataset was downloaded as a CSV (comma separated values) file, which is a common format for datasets. To read the file and transform it into a dataframe in R, we use the function read_csv. “deaths” now refers to our dataframe. We can see the data has 6 attributes or columns: Year, 113 Cause Name, Cause Name, State, Deaths, and Age-adjusted Death Rate.
library(tidyverse)
## -- Attaching packages -------------
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ----------------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
deaths <- read_csv("c:/users/Anastasiya/Documents/NCHS_-_Leading_Causes_of_Death__United_States.csv")
## Parsed with column specification:
## cols(
## Year = col_integer(),
## `113 Cause Name` = col_character(),
## `Cause Name` = col_character(),
## State = col_character(),
## Deaths = col_integer(),
## `Age-adjusted Death Rate` = col_double()
## )
deaths
Luckily, we are working with a very tidy and complete dataset. Each attribute forms a column, and each entity forms a row, although our dataset does need a few small tweaks.
Examples of untidy data:
Follow this link for more information regarding data tidying: http://www.hcbravo.org/IntroDataSci/bookdown-notes/tidying-data.html
There is a minor amount of missing data. A small amount of death rates are blank. Looking at the data I can see this is because the corresponding number of deaths is very small. All entities that have an empty death rate attribute have less than 20 deaths. Dividing less than 20 deaths by the population of any state will result in a very small death rate, so I think it is safe to record the death rate as 0 for these occurences.
Follow this link for more information regarding handling missing data: http://www.hcbravo.org/IntroDataSci/bookdown-notes/eda-handling-missing-data.html
Our dataset is almost ready for analysis, we just need a few tweaks:
names(deaths) <- gsub(" ","_",names(deaths))
colnames(deaths)[6] <- "Death_Rate"
deaths[deaths == "CLRD"] <- "Chronic lower respiratory diseases"
deaths[is.na(deaths)] <- 0
deaths<-deaths[!(deaths$State=="United States"),]
deaths <- deaths %>%
select(Year, Cause_Name, State, Deaths, Death_Rate) %>%
arrange(State)
deaths
This dataset is a goldmine for exploratory data analysis. We will be focusing on visualization.
Choropleth maps are are maps where regions are shaded in relation to a data variable. This is perfect for our dataset because we have states as one of our attributes! This map will show the average death rates for all causes per state.
First, I will select the entities in our dataframe that correspond to “All Causes”. I want to look at the average death rates for each state, to do this I first need to group my data by state, and then I need to take the mean of the corresponding death rate. Then I construct my map.
Because this is a state map, DC is not represented. You are able to hover over the state to get the corresponding death rate. As we can see, Mississippi has the highest death rate. This comes as no surprise to me, as Mississippi has high rates of poverty and obesity.
Follow this link for more information on why Mississippi is considered the “unhealthiest state”: https://www.clarionledger.com/story/news/politics/2017/12/12/mississippi-again-unhealthiest-state-country/943720001/
library(plotly)
## Warning: package 'plotly' was built under R version 3.4.4
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
all_causes <- deaths %>%
filter(Cause_Name == "All Causes") %>%
group_by(State) %>%
summarise(Mean_Rate = mean(Death_Rate)) %>%
mutate(Code = state2abbr(State))
all_causes
geog <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = FALSE
)
choro <- plot_geo(all_causes, locationmode = 'USA-states') %>%
add_trace(
z = ~Mean_Rate, locations = ~Code,
color = ~Mean_Rate, colors = 'RdPu'
) %>%
colorbar(title = "Death Rate per 100,000") %>%
layout(
title = 'Average Age-Adjusted Death Rates by State',
geo = geog
)
choro
#### Leading Cause of Death
I know this might be easy to figure out from prior knowledge, but let us see what our data tells us. Here we group by cause and year, and then take the mean of the amount of deaths across all states. We then plot the data using a bar graph so we can see which cause has the most average deaths associated with it. It is no surprise that “Diseases of Heart” is the leading cause of death, with cancer coming in second, and stroke coming in third.
library(ggplot2)
causes <- deaths %>%
filter(Cause_Name != "All Causes") %>%
group_by(Cause_Name, Year) %>%
summarize(Mean_Deaths = mean(Deaths)) %>%
ggplot(mapping=aes(x=Cause_Name, y=Mean_Deaths)) +
geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 27, hjust =1))
causes
As time progresses, we assume human life expectancy will increase with the help of technology and medical advances. Let us first look at the three leading causes we obtained from the previous bar graph and see if they have been decreasing over time. Here we filter the dataframe so we are only looking at the three main causes, and then we calculated the mean death rate. Faceting allows us to look at multiple graphs in one graphic, making it easier to compare and saving space.
As we can see from the graphs, all three leading causes of death have been decreasing in death rate over time. That is good news!
library(ggplot2)
over_time <- deaths %>%
filter(Cause_Name == "Diseases of Heart" | Cause_Name == "Cancer" | Cause_Name == "Stroke") %>%
group_by(Cause_Name, Year) %>%
summarize(Mean_Rate = mean(Death_Rate))
over_time
over_time %>%
ggplot(aes(x=Year, y=Mean_Rate)) +
facet_grid(Cause_Name~.) +
geom_smooth()
## `geom_smooth()` using method = 'loess'
over_time
Now, let’s figure out if death rate has been decreasing over time for all causes. We have a negative slope with a mostly linear relationship, meaning overall death rates have been decreasing over time.
library(ggplot2)
over_time <- deaths %>%
filter(Cause_Name == "All Causes") %>%
group_by(Cause_Name, Year) %>%
summarize(Mean_Rate = mean(Death_Rate))
over_time
over_time %>%
ggplot(aes(x=Year, y=Mean_Rate)) +
geom_smooth()
## `geom_smooth()` using method = 'loess'
over_time
As mentioned before, deaths related to health/medical problems are expected to decrease with technological and medical advancement. But, is there a trend for other causes of death? We will look at three seemingly unpredictable causes of death: homicide, suicide, and unintentional injuries. We will use faceting just like we did with the three leading causes of death. From the graphs we can see that homicide has not shown any significant changes, besides have a very small dip in the past few years. On the other hand, suicide has shown a slight increase, and unintentional injuries have also been increasing overall. Perhaps some of these could be explained by obtaining and analyzing further data sets.
library(ggplot2)
over_time <- deaths %>%
filter(Cause_Name == "Homicide" | Cause_Name == "Suicide" | Cause_Name == "Unintentional Injuries" ) %>%
group_by(Cause_Name, Year) %>%
summarize(Mean_Rate = mean(Death_Rate))
over_time
over_time %>%
ggplot(aes(x=Year, y=Mean_Rate)) +
facet_grid(Cause_Name~.) +
geom_smooth()
## `geom_smooth()` using method = 'loess'
Let’s see if we can learn anything about crime/homicides per state using our dataset. Once again we are creating a Choropleth map. The resulting map is not as saturated as our death rate map, this is because homicides were a fairly small statistic in this dataset. We can conclude that this is not an appropriate dataset for looking at crime/homicide statistics per state.
library(plotly)
library(openintro)
all_causes <- deaths %>%
filter(Cause_Name == "Homicide") %>%
group_by(State) %>%
summarise(Mean_Rate = mean(Death_Rate)) %>%
mutate(Code = state2abbr(State))
all_causes
geog <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = FALSE
)
choro <- plot_geo(all_causes, locationmode = 'USA-states') %>%
add_trace(
z = ~Mean_Rate, locations = ~Code,
color = ~Mean_Rate, colors = 'BuPu'
) %>%
colorbar(title = "Death Rate per 100,000") %>%
layout(
title = 'Homicides By State',
geo = geog
)
choro
Linear regression is used to predict the value of a dependent variable using one or more independent variables. It lets us determine if there is a relationship between the variables.
Hypoethesis testing is used to determine if there is enough statistical evidence to support a belief/hypothesis. There is the alternative hypothesis, which is the hypothesis we are trying to prove, and the null hypothesis, which we are trying to disprove.
Follow this link to learn more about hypothesis testing: https://www.utdallas.edu/~scniu/OPRE-6301/documents/Hypothesis_Testing.pdf
For our data, we are trying to show that there is a relationship between the two leading causes of death and year. So, our null hypothesis is that there is no relatioship.
There are two important terms for determining rejection of the null hypothesis: alpha value and p-value. The standard alpha value is .05, which is the probability of rejecting the null hypothesis when it is true. P-values are the chance that a value falls within the normal distribution under the null hypothesis. If our p-value is less than our alpha, then there is a very small chance that the null hypothesis represents the data.
Here we fit a linear regression model and use broom::tidy to gather some statistics. From this we see that we have an extremely small p-value, much less than the alpha of .05, so we can reject the null hypothesis of there being no relationship between year and leading cause of death!
Next we want to examine if year and state possibly have an interaction, aka the death rates are affected by year and state. If this interaction is true, we can then get a more accurate model. However, after fitting the regression model, as we look through the state p-values, none are less than our alpha of .05, so the null hypothesis cannot be rejected, and there seems to be no interaction between state and year.
over_time <- deaths %>%
filter(Cause_Name == "Diseases of Heart" | Cause_Name == "Cancer") %>%
group_by(Cause_Name, Year)
plot <- over_time %>%
ggplot(aes(x=factor(Year), y=Death_Rate)) +
geom_point(aes(color=Cause_Name))
over_time %>%
ggplot(aes(x = Year, y = Death_Rate, color = State)) + geom_point() + geom_smooth(method=lm)
fit <- lm(data=over_time, Death_Rate~Year)
broom::tidy(fit)
lm1 <- lm(data=over_time, Death_Rate~(Year*State))
broom::tidy(lm1)